[pull] master from DataDog:master#592
Merged
Merged
Conversation
* feat(nutanix): alert lifecycle tracking with per-state metrics
Replace the cursor-based alert collection with a reconciliation loop
against the v4.0 unresolved-alerts API. Ship per-state lifecycle gauges
and a default monitor template:
- nutanix.alert.open — 1 while alert is unresolved + unacknowledged
- nutanix.alert.acknowledged — 1 while alert is unresolved + acknowledged
- nutanix.alert.resolved — 1 once when alert enters the resolved state
State transitions emit explicit zeros to the previous state's metric so
per-alert monitor cases recover cleanly when the alert leaves a state.
Each metric carries ext_id for monitor grouping; the metric name itself
encodes the state, so monitor queries don't need a tag filter.
Lifecycle events (in addition to the metrics):
- "Alert: <title>" — created (or re-opened from resolved)
- "Alert acknowledged: <title>" — open -> acknowledged transition
- "Alert reopened: <title>" — acknowledged -> open transition
- "Alert Resolved: <title>" — resolution with resolvedTime / by /
auto_resolved metadata
Reconciliation is the source of truth each cycle: alerts in the API but
not in the in-memory cache are new (emit open event); alerts in the cache
but absent from the API are resolved or deleted (emit resolution event +
.open or .acknowledged = 0, .resolved = 1); alerts in both have their
cached metadata refreshed and ack-state transitions emit dedicated
events. Stateless across check cycles in terms of persistence — agent
restarts re-derive state from the API; the aggregation_key collapses any
visible duplicate creation events on restart.
Hardening:
- on transient API failure, re-emit cached gauges before re-raising so
per-alert monitors don't auto-resolve while the alert is still open.
- pre-compute new/gone/still-tracked sets before mutating _open_alerts
so loop ordering is safe.
- v4.2 fallback removed; v4.0 endpoint with $filter=isResolved eq false
is the only path. The pre-existing client-side filter remains as a
safety net.
Tags added to alert events and metrics:
- ext_id, ntnx_alert_type, ntnx_alert_severity, ntnx_alert_status
(events only — redundant on metrics where the name encodes state)
- ntnx_originating_cluster_name, ntnx_alert_user_defined,
ntnx_alert_service (Tier 1 — distinguish federated cluster, custom
vs platform alerts, and Nutanix subsystem when present)
- ntnx_cluster_name, ntnx_alert_classification, ntnx_alert_impact,
ntnx_alert_auto_resolved (resolution events only), source-entity tags
Default monitor template at assets/monitors/alerts.json combines
nutanix.alert.open + nutanix.alert.acknowledged minus
nutanix.alert.resolved to alert on any unresolved alert (clamped to
non-negative). Auto-resolves on the resolved one-shot. Description
notes the agent-restart re-broadcast trade-off.
Test coverage: state transitions (open<->ack, ack->resolved from each
prior state), filter-add edge case (treated as spurious resolution),
deleted-alert (_get_alert returns None) graceful fallback, empty
unresolved list cold-start, and per-tag assertions for the new Tier 1
tags. The four "complete output" alertType tests are parametrized.
conftest mock has a _filter_after helper for the time-based fixture
branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): consolidate alert changelog entries
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): revert version bump and shorten changelog
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* working monitor
* feat(nutanix): heartbeat open alerts each cycle for event-based monitors
Re-emit one alert event per tracked alert per check cycle so event-count
monitors (last("Nm") > 0) stay firing while the alert is open. Transition
cycles skip the heartbeat — the dedicated transition event already lands
under the same aggregation_key. Resolved alerts are popped from
_open_alerts before the heartbeat loop, so they don't get a duplicate
heartbeat alongside their resolution event.
Ship the default monitor template at assets/monitors/alerts.json with a
real title and description.
Tests cover the new heartbeat skip-list (transitions, resolutions,
filter-exclusion), aggregation_key consistency across the full alert
lifecycle, and the cached-gauges-no-events contract on transient API
failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): revert version bump and simplify monitor threshold
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(nutanix): address review feedback on alert lifecycle tracking
- Drop the leaky self.alerts cache; _get_alert is now a one-shot fetch.
- Warn loudly when the client-side isResolved safety net drops alerts.
- Note that lastUpdatedTime is the closest signal for ack->open transitions.
- Fix triple-space in monitor title, fix date format to YYYY-MM-DD.
- Document the alert lifecycle and agent-restart behavior in the README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(nutanix): address alert tracking review feedback
- Repoint NutanixCheck.alerts at _open_alerts so the public property
matches what the activity monitor actually carries.
- Restore a per-cycle dedup cache on _get_alert so _process_task does
not issue N x M GETs for tasks referencing the same alerts.
- Extract _reconcile_alerts into named helpers (new, resolved,
transitioned, heartbeat, cached-gauge fallback) so the coordinator
reads cleanly.
- Namespace ext_id as ntnx_alert_ext_id on metrics, events, the
monitor template, and metadata.csv to avoid colliding with other
sources in the global ext_id tag.
- Log a warning when the unresolved-alerts list call fails, and
another when a gone alert cannot be fetched back from Prism Central.
- Switch nutanix.alert.resolved from gauge to count so a resolved
alert reopening with the same extId does not leave a stuck
resolved=1 series.
- Monitor template: query threshold > 0, use is_recovery consistently.
- Update record_fixtures.py to use the production isResolved eq false
filter so re-recorded fixtures match what the integration queries.
- Document the alert lifecycle, agent restart behavior, and recommended
metric monitor patterns in the README.
- Add a test for the resolved to open lifecycle with the same extId.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(nutanix): quote metadata.csv description containing a comma
The nutanix.alert.resolved description contained an unquoted comma,
splitting the row into 12 columns at parse time and breaking metadata
validation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(nutanix): correct alert reconciliation edge cases from review
Two distinct correctness fixes to the alert lifecycle tracking path:
- _get_alert no longer swallows every exception. A 404 still returns None
(the alert was deleted upstream), but transient HTTP failures (5xx,
network, timeout) propagate so they are not silently misclassified as
deletions and emit degraded "Resolved" events. _emit_resolved_alerts
catches the propagated HTTPError per-alert, restores tracking, and
retries on the next cycle. Returned count now reflects actual emissions.
- _reconcile_alerts now distinguishes alerts that truly left the
unresolved-alerts API (resolved/deleted upstream) from alerts that are
still open in Prism Central but no longer match the configured
resource_filters. The latter are dropped from tracking silently with an
info log; no resolution event or nutanix.alert.resolved increment is
emitted, since the alert is not resolved.
Test updates:
- New test_get_alert_returns_none_on_404, test_get_alert_propagates_on_transient_http_error[500/502/503/504], and test_transient_alert_get_failure_preserves_tracking.
- Existing test_alert_filter_excludes_tracked_alert_emits_spurious_resolution rewritten as test_alert_filter_excludes_tracked_alert_drops_without_resolution (pinned the prior bug; now asserts the correct behavior).
- New test_resolution_event_still_fires_when_alert_truly_leaves_unresolved_api guards the gone_ids semantics regression.
- conftest 404 mocks switched from bare Exception to requests.exceptions.HTTPError(response=...) so the 404 branch is actually exercised.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Apply suggestions from code review
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
* fix(nutanix): shorten monitor description to under 300 chars
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(nutanix): address review feedback on alert tracking
- Guard the best-effort _get_alert call in _process_task with try/except
HTTPError so a transient failure no longer aborts the task collection
cycle, matching the guard already used in _emit_resolved_alerts.
- Report currently-tracked open alerts in the check summary log instead
of the per-cycle change count, which read 0 on quiet cycles. Only the
INFO summary is affected; events and metrics are unchanged.
- Drop the leading underscore from the SEVERITY_TO_ALERT_TYPE constant.
- Move the fixture_alert helper to conftest.py and inline the
complete-output parametrize cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(nutanix): stamp open alert event at observation time
The open/heartbeat alert event was timestamped at the alert's
creationTime. Event monitors window on occurrence time, so back-dated
heartbeats never entered the recommended monitor's trailing 5m window:
the monitor fired once at creation, then auto-recovered ~5m later and
stayed OK regardless of the alert's real state in Prism Central.
Stamp the open/heartbeat event at observation time so it lands in the
monitor's rolling window. Recovery is now driven by heartbeats ceasing,
so the monitor recovers ~5m after the alert actually resolves.
Resolution and transition events keep their real timestamps; they are
not counted by the status:open query and only feed the timeline.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(nutanix): drop fixed changelog fragments
Keep only the added fragments on this branch; the fixed entries are
removed at the maintainer's request.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Janine Chan <64388808+janine-c@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )